Compiling a Partition-Based Two-Level Formalism 
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Abstract 

This paper describes an algorithm for the 
compilation of a two (or more) level or- 
thographic or phonological rule notation 
into finite state transducers. The no- 
tation is an alternative to the standard 
one deriving from Koskenniemi's work: 
it is believed to have some practical de- 
scriptive advantages, and is quite widely 
used, but has a different interpretation. 
Efficient interpreters exist for the nota- 
tion, but until now it has not been clear 
how to compile to equivalent automata 
in a transparent way. The present paper 
shows how to do this, using some of the 
conceptual tools provided by Kaplan and 
Kay's regular relations calculus. 

1 Introduction 

Two-level formalisms based on that introduced 



by (Koskcnnicmi, 1983) (see also (Ritchie et al 



1992| ) and ( [Kaplan and Kay, 1994| )) are widely 
used in practical NLP systems, and are deservedly 
regarded as something of a standard. However, 
there is at least one serious rival two-level notation 
in existence, developed in response to practical 
difficulties encountered in writing large-scale mor- 
phological descriptions using Koskenniemi's nota- 
tion. The formalism was first introduced in ( Black 



tages to Koskenniemi's notation. These are de- 
tailed more fully in (Black et al., 1987, pp. 13-15), 
and in (|Ritchie et al., 1992|, pp. 181-9). In brief: 



(1) Koskenniemi rules are not easily interpretable 
(by the grammarian) locally, for the interpretation 
of 'feasible pairs' depends on other rules in the 
set. (2) There are frequently interactions between 
rules: whenever the lexical/surface pair affected 
by a rule A appears in the context of another rule 
B, the grammarian must check that its appearance 
in rule B will not conflict with the requirements of 
rule A. (3) Contexts may conflict: the same lexical 
character may obligatorily have multiple realisa- 
tions in different contexts, but it may be impossi- 
ble to state the contexts in ways that do not block 
a desired application. (4) Restriction to single 
character changes: whenever a change affecting 
more than one adjacent character occurs, multi- 
ple rules must be written. At best this prompts 
the interaction problem, and at worst can require 
the rules to be formulated with under-restrictive 
contexts to avoid mutual blocking. (5) There is 
no mechanism for relating particular rules to spe- 
cific classes of morpheme. This has to be achieved 
indirectly by introducing special abstract trigger- 
ing characters in lexical representations. This is 
clumsy, and sometimes descriptively inadequate 



( [Trost, 1990| ). 

Some of these problems can be alleviated by 
the use of a rule compiler that detects conflicts 



such as that described in (Karttunen and Beesley. 



and an extended version of it was proposed for use 
in the European Commission's ALEP language 
engineering platform (Pulman, 1991). A further 



| al., 198V]), was adapted by ([ilucssmk, 1989|), 19"92|). Others could be overcome by simple exten- 



extension to the formalism was described in (Pul- 
man and Hcpple, 1993). 



The alternative partition formalism was mo- 
tivated by several perceived practical disadvan- 



*Supported by SERC studentship no. 92313384. 
^Supported by a Benefactors' Studentship from St 
John's College. 



sions to the formalism. But several of these prob- 
lems arise from the interpretation of Koskcnnicmi 
rules: each rule corresponds to a transducer, and 
the two-level description of a language consists of 
the intersection of these transducers. Thus some- 
how or other it must be arranged that every rule 
accepts every two-level correspondence. We refer 
to this class of formalisms as 'parallel': every rule, 
in effect, is applied in parallel at each point in the 
input. 



The partition formalism consists of two types 
of rules (defined in more detail below) which en- 
force optional or obligatory changes. The notion 
of well-formedness is defined via the notion of a 
'partition' of a sequence of lexical/surface corre- 
spondences. Informally, a partition is a valid anal- 
ysis if (i) every element of the partition is licensed 
by an optional rule, and (ii) no element of the 
partition violates an obligatory rule. 

We have found that this formalism has some 
practical advantages: (1) The rules are relatively 
independent of each other. (2) Their interpreta- 
tion is more familiar for linguists: each rule copes 
with a single correspondence: in general you don't 
have to worry about all other rules having to be 
compatible with it. (3) Multiple character changes 
are permitted (with some restrictions discussed 
below). (4) A category or term associated with 
each rule is required to unify with the affected 
morpheme, allowing for morpho-syntactic effects 
to be cleanly described. (5) There is a simple and 
efficient direct interpreter for the rule formalism. 

The partition formalism has been implemented 
in the European Commission's ALEP system for 
natural language engineering, distributed to over 
30 sites. Descriptions of 9 EU languages are 
being developed. A version has also been im- 
plemented within SRI's Core Language Engine 
( Carter, 1995| ) and has been used to develop de- 
scriptions of English, French, Spanish, Polish, 
Swedish, and Korean morphology. An N-level ex- 
tension of the formalism has also been developed 
by flKiraz, 1994[ [Kiraz, 1996t| ) and used to de- 
scribe the morphology of Syriac and other Semitic 
languages, and by (Bowden and Kiraz, 1995) for 
error detection in nonconcatenative strings. This 
partition-based two-level formalism is thus a seri- 
ous rival to the standard Koskcnnicmi notation. 

However, until now, the Koskenniemi notation 
has had one clear advantage in that it was clear 
how to compile it into transducers, with all the 
consequent gains in efficiency and portability and 
with the ability to construct lexical transducers 
as in (Karttunen, 1994). This paper sets out to 



remedy that defect by describing a compilation 
algorithm for the partition-based two-level nota- 
tion. 

2 Definition of the Formalism 
2.1 Formal Definition 

We use n tapes, where the first N tapes are 
lexical and the remaining M are surface, n = 
N + M. In practice, M = 1. We write E 4 
for the alphabet of symbols used on tape i, and 
£ = (Si U {e}) x ... x (£„ U {e}), so that S* is 



the set of string-tuples representing possible con- 
tents of the n tapes. A proper subset of regular 
n-relations have the property that they are ex- 
pressible as the Cartesian product of n regular 
languages, R — Ri x ... x i?„; we call such re- 
lations 'orthogonal'. (We present our definitions 
along the lines of flKaplan and Kay, 1994 )). 

We use two regular operators: Intro and Sub. 
IntrosL denotes the set of strings in L into which 
elements of S may be arbitrarily inserted, and 
Sub^, bL denotes the set of strings in L in which 
substrings that are in B may be replaced by 
strings from A. Both operators map regular lan- 
guages into regular languages, because they can 
be characterised by regular relations: over the al- 
phabet S, Intros = (ld E U({e} x S))*, Sub A: B = 
(Id s U(Bx A))*, where Id L = {(s,s) | s 6 L}, 
the identity relation over L. 

There are two kinds of two- level rules. The con- 
text restriction, or optional, rules, consist of a left 
context I, a centre c, and a right context r. Surface 
coercion, or obligatory, rules require the centre to 
be split into lexical q and surface c s components. 

Definition 2.1 A N:M context restriction 
(CR) rule is a triple (l,c,r) where l,c,r are 
'orthogonal' regular relations of the form I = 
li x ... x l n , c = c\ x ... x c„, r = n X ... X r n . □ 

Definition 2.2 A N:M surface coercion (SC) 
rule is a quadruple (l,ci,c s ,r) where I and r 
are 'orthogonal' regular relations of the form / = 
li X ... X l n , r = t*i X ... X r„, and c/ and c s 
are 'orthogonal' regular relations restricting only 
the lexical and surface tapes, respectively, of the 
form ci — ci x ... x % x E 



N+l 



□ 



T,{ x ... x E* N x c N+1 x ... x c N+M . 



We usually use the following notation for rules: 
LLC Lex RLC =H<H«=> 

LSC Surf RSC 

where 

LLC (left lexical context) = (li, . . . , In) 

Lex (lexical form) = (c\, . . . , cjy) 

RLC (right lexical context) = (ri, . . . , r^) 

LSC (left surface context) = (ijv+i, ■ ■ ■ , In+m) 

Surf (surface form) = (cat +1 , . . . , cn+m) 

RSC (right surface context) = (rjv+i, . . . , Tn+m) 

Since in practice all the left contexts U start 
with S* and all the right contexts end with E* , 
we omit writing it and assume it by default. The 
operators are: for CR rules, <S= for SC rules and 
<^> for composite rules. 

A proposed morphological analysis P is an n- 
tuple of strings, and the rules are interpreted as 



applying to a section of this analysis in context: 
P = PiP c P r (n-way concatenation of a left con- 
text, centre, and right context). Formally: 

Definition 2.3 A CR rule (l,c,r) contextually 

allows (Pi,P c ,P r ) iff Pi G I, P r G r and P c G c. 

□ 

Definition 2.4 An SC rule (l,ci,c r ,r) coer- 
cively disallows (P;,P c ,P r ) iff Pi G I, P r G r, 

P c G Q and P c £ c s . □ 

Definition 2.5 A iV:M two-level grammar is 

a pair (P=^, where is a set of N:M con- 
text restriction rules and is a set of N:M sur- 
face coercion rules. □ 

Definition 2.6 A two-level grammar (R^,R^) 
accepts the string-tuple P, partitioned as 
P\,...,Pk, iff P = PiP-2---Pk (n-way concate- 
nation) and (1) for each i there is a CR rule 
A G R^, such that A contextually allows 
(Pi...P i -i,P l ,P l+ i...P k ) and (2) there are no i <j 
such that there is an SC rule B G i?<= such that B 
coercively disallows (Pi...Pj_i, Pj...Pj_i, Pj...Pk). 

There are some alternatives to condition (2): 

(2i) there is no i such that there is an SC 
rule B G R^ such that B coercively disallows 
(Pi...P i - 1 ,P l ,P l+1 ...P k ): this is (2) with the re- 
striction j = i + 1; since SC rules can only ap- 
ply to the partitions Pj, epenthetic rules such as 
(E*(fc,fc),e x E|,EJ x a, (fc,fc)E*) ('insert an a 
between lexical and surface /cs') can not be en- 
forced: the rule would disallow adjacent (k, k)s 
only if they were separated by an empty parti- 
tion: ...(k, k), e, (k, k)... would be disallowed, but 
...(k,k), (k,k)... would be accepted. 

(2ii) there is no i such that there is an SC 
rule B G R^ such that B coercively disallows 
(Pi...Pj_i, Pi, Pi + \...Pk) or B coercively disallows 
(P\...Pi-\, Pi...Pk): this is (2) with the restriction 
j = i + 1 or j = i; this allows epenthetic rules 
to be used but may in certain cases be counterin- 
tuitive for the user when insertion rules are used. 
For example, the rule (E*(g,g),ux E|, E* x v, E*) 
('change u to v after a <?') would not disallow a 
string-tuple partitioned as ...(5, g), (e, e), (u, u)... - 
assuming some CR rule allows (e, e). 

Earlier versions of the partition formalism could 
not (in practice) cope with multiple lexical char- 
acters in SC rules - see (Carter, 1995, §4.1). This 
is not the case here. 

The following rules illustrate the formalism: 

V B * => 

V b * 



R2: 



R3: 



B 

b 

c 
c 



B 
b 



d 
d 



Rl and R2 illustrate the iterative application of 
rules on strings: they sanction the lexical-surface 
strings (VBBB,Vbbb), where the second (B,b) 
pair serves as the centre of the first application 
of R2 and as the left context of the second ap- 
plication of the same rule. R3 is an epenthetic 
rule which also demonstrates centres of unequal 
length. (We assume that (V,V), (c,c) and (d,d) 
are sanctioned by other identity rules.) 

The conditions in Definitions 2.1 and |2.2| that 



Rl: 



restrict the regular relations in the rules to be- 
ing 'orthogonal' are required in order for the fi- 
nal language to be regular, because Definition 2.6 
involves an implicit intersection of rule contexts, 
and we know that the intersection of regular rela- 
tions is not in general regular. 

2.2 Regular Expressions for Compilation 

To compile a two-level grammar into an automa- 
ton we use a calculus of regular languages. We 
first use the standard technique of converting reg- 
ular n-relations into same-length regular relations 
by padding them with a space symbol 0. Unlike 
arbitrary regular n-relations, same-length regular 
relations are closed under intersection and comple- 
mentation, because a theorem tells us that they 
correspond to reg ular languages over (e -free) n- 
tuples of symbols (Kaplan and Kay, 1994, p. 342). 

A proposed morphological analysis P = P\...Pk 
can be represented as a same-length string-tuple 
u}PiuiP2UJ...ujPkLO, where Pi G S* is Pi converted 
to a same-length string-tuple by padding with 
0s, and u> = (u>i, cj„), where the {oji} are 
new symbols to indicate the partition boundaries, 
Ui £ Si U {0}. 

Since in a partitioned string-tuple accepted by 
the grammar (R^, P<s=) each Pi G c for some CR 
rule (l,c,r) G R^, we can make this representa- 
tion unique by defining a canonical way of convert- 
ing each such possible centre C into a same-length 
string-tuple C. A simple way of doing this is to 
pad with 0s at the right making each string as long 
as the longest string in C: if C = (pi, ...,p n ), 

c = (p 1 o* 1 ..,p n o*>ns*-s*(o,..,o) (1) 

However, since we know the set of possible par- 
titions - it is lj{c I 3l,r(l, c,r) G R=?} - we can 
reduce the number of elements of E in use, and 
hence simplify the calculations, by inserting the 0s 
in a more flexible manner: e.g., if C = (ab,b), let 
C — (ab,0b) rather than C = (ab,b0): assuming 



another rule requires us to use (b, b) anyway, we 
only have to add (a, 0) rather than {a, b) and (b, 0). 
The preprocessor could use simple heuristics to 
make such decisions. In any case, the padding of 
possible partitions carries over to the centres c of 
CR rules: if (Z,c,r) e R^, c = {C | C G c}. 
Henceforth let ir be the set of elements of £ that 
appear in some 0-padded rule centre. 

The contexts of all rules and the lexical and 
surface centres of SC rules must be converted into 
same-length regular n-relations by inserting 0s at 
all possible positions on each tape independently: 

if X - — - X\ X ... X Xfi^ 

x° = (lntro{ }2;i x ... x Intro{g}a;„) (~l ir* (2) 

Note the difference between this insertion of 
everywhere, denoted x°, and the canonical 
padding c. Both require the 'orthogonality' condi- 
tion in order for the intersection with tt* to yield 
a regular language: inserting 0s into (a,b)* at 
all possible positions on each tape independently 
would give a non-regular relation, for example. 

Now we derive a formula for the set of 0-padded 
and partitioned analysis strings accepted by the 
grammar (i?^, i?<^): The set of 0-padded centres 
of context restriction rules is given by: 

D = {c\ 3l,c,r.(l,c,r) <E (3) 

Here we assume that these centres are disjoint 
(Vc, d E D.c = d V c fl d — 0), because in prac- 
tice each c is a singleton set, however there is an 
alternative derivation that does not require this. 

We proceed subtractively, starting as an initial 
approximation with an arbitrary concatenation of 
the possible partitions, i.e. the centres of CR rules: 

uj(Duj)* (4) 

From this we wish to subtract the set of strings 
containing a partition that is not allowed by any 
CR rule: We introduce a new placeholder symbol 
r, r 7r U {uj}, to represent the centre of a rule, 
so the set of possible contexts for a given centre 
c G D is given by: 

U (5) 

So the set of contexts in which the centre c may 
not appear is the complement of this: 

ir*Tir* - |J l°Tr° (6) 

Now we can introduce the partition separator to 
throughout, then substitute the centre itself, lucoj, 
for its placeholder r in order to derive an expres- 
sion for the set of partitioned strings in which an 
instance of the centre c appears in a context in 
which it is not allowed: [o denotes composition] 



T7T* - |J Z rr° 

If we subtract a term like this for each c £^D 
from our initial approximation (eq. then we 
have an expression for the set of strings allowed 
by the CR rules of the grammar: 



(jj(Dto)* 



U Sub, 



(8) 



c£D 



Intro 



u 

{Lc,r)£R = 



l°Tr° 



It remains to enforce the surface coercion rules 
R^. For a given SC rule (/, c/, c s , r) £ i?<^, a first 
approximation to the set of strings in which this 
rule is violated is given by: 

(9) 



Intro M (Z°u;(c? 



Here (cf — c°) is the set of strings that match 
the lexical centre but do not match the surface 
centre. For part (2) of Definition 2.6 to apply this 



must equal the concatenation of or more adja- 
cent partitions, hence it has on each side of it the 
partition separator w, and the operator Intro in- 
troduces additional partition separators into the 
contexts and the centre. The only case not yet 
covered is where the centre matches adj acen t 
partitions (i = j in part (2) of Definition ^(]). 
This can be dealt with by prefixing with the sub- 
stitution operator Sub WjWU , so the set of strings in 
which one of the SC rules is violated is: 



u 



Sub, 



o Intro M (Z°u(c? -c° a )ur°) 



We subtract this too from our approximation 
(cq. ^|) in order to arrive at a formula for the set 
of 0-padded and partitioned strings that are ac- 
cepted by the grammar: 



S 



>{Du>)* - |J Sub. 



deD 



Intro {w} ttVtt* - (J l°w'r° 

\ (l,c,r)eR= 

u 



Sub,, 



(i,ci,c s ,i-)e-R< 



Intro M (^(cy-c>r u ) 



(11) 



Finally, we can replace the partition separator 
u and the space symbol by e to convert Sq into 
a regular (but no longer same-length) relation S 
that maps between lexical and surface representa- 
tions, as in (Kaplan and Kay, 1994, p. 368). 



3 Algorithm and Illustration 

This section goes through t he compilation of the 
sample grammar in section 2.1 step by step. 



3.1 Preprocessing 

Preprocessing involves making all expressions of 
equal-length. Let, Si = {V,B,c,d,0} and S 2 = 
{V,b,c,d,0} be the lexical and surface alphabets, 



respectively. We pad all centres with O's (eq. 
then compute the set of 0-padded centres (eq. 



D = {(B,b), (0,b), (V,V), (c,c), (d,d)} (12) 

We also compute contexts (eq. |^) . Uninstantiated 
contexts become 

Intro {0 }(££) x Intro {0} (£2) (13) 

The right context of R3, for instance, becomes 

Intro{ }(dSi) x Intro^^dSg) (14) 

3.2 Compilation into Automata 

The algorithm consists of three phases: (1) con- 
structing a FSA which accepts the centres, (2) ap- 
plying CR rules, and (3) forcing SC constraints. 

The first approximation to the grammar (eq. ^) 
produces FSAi which accepts all centres. 




Phase 2 deals with CR rules. We have two cen- 
tres to process: (B,b) (Rl & R2) and (0,b) (R3). 
For each centre, we compute the set of invalid con- 
texts in which the centre occurs (eq. 0). Then we 
subtract this from FSAi (eq. 0), yielding FSA2. 



<d,d> 




FSA 2 

The third phase deals with SC rules: here the 
<= portion of R3. Firstly, we compute the set of 
strings in which R3 is violated (eq. [to|) . Secondly, 
we subtract the result from FSA2 (eq. [ll]), re- 
sulting in an automaton which only differs from 
FSA2 in that the edge from q$ to qo is deleted. 



4 Comparison with Previous 
Compilations 

This section points out the differences in compil- 
ing two-level rules in Koskenniemi's formalism on 
one hand, and the one presented here on the other. 

4.1 Overlapping Contexts 

One of the most important requirements of two- 
level rules is allowing the multiple applications 
of a rule on the same string. It is this require- 
ment which makes the compilation procedures in 
the Koskcnnicmi formalism - described in (Ka- 
plan and Kay, 1994) - inconvenient. 'The multi- 
ple application of a given rule', they state, 'will 
turn out to be the major source of difficulty in 
expressing rewriting rules in terms of regular re- 
lations and finite-state transducers' (p. 346). The 
same difficulty applies to two-level rules. 

Consider Rl and R2 (§|2~l|), and D = 
{(V,V),(B,b)}. QKaplan and Kay, 1994Q express 



CR rules by the relation,^] 



Restrictive, I, r) — n*l C7r* n tt*c m* (15) 

This expression 'does not allow for the possibil- 
ity that the context substring of one application 
might overlap with the centre and context por- 
tions of a preceding one' (p. 371). They resolve 
this by using auxiliary symbols: (1) They intro- 
duce left and right context brackets, <k and >/., 
for each context pair ly. ~ of a specific centre 
which take the place of the contexts. (2) Then 
they ensure that each <k'-<k only occurs if its 
context Ik has occurred, and each >k->k only oc- 
curs if followed by its context r^. The automaton 
which results after compiling the two rules is: 




Removing all auxiliary symbols results in: 



1 This expression is an expansion of Restrict in 
(|Kaplan and Kay, 1994 P- 371). 



rules is not conditional; it is general enough to 
cope with all sorts of rules, epenthetic or not. 




Our algorithm produces this machine directly. 
Compiling Koskenniemi's formalism is compli- 
cated by its interpretation: rules apply to the en- 
tire input. A partition rule is concerned only 
with the part of the input that matches its centre. 

4.2 Conditional Compilation 

Compiling epenthetic rules in the Koskenniemi 
formalism requires special means; hence, the algo- 



rithm is conditional on the type of the rule (Ka- 



plan and Kay, 1994, p. 374). This peculiarity, in 
the Koskenniemi formalism, is due to the dual in- 
terpretation of the symbol in the parallel formal- 
ism: it is a genuine symbol in the alphabet, yet it 
acts as the empty string e in two-level expressions. 
Note that it is the duty of the user to insert such 
symbols as appropriate flKarttuncn and Beesley" 
199^ ) 



This duality does not hold in the partition 
formalism. The user can express lexical-surface 
pairs of unequal lengths. It is the duty of the rule 
compiler to ensure that all expressions are of equal 
length prior to compilation. With CR rules, this 
is done by padding zeros. With SC rules, however, 
the Intro operator accomplishes this task. There 
is a subtle, but important, difference here. 

Consider rule R3 (§2.1). The 0-padded centre 
of the CR portion becomes (0,b). The SC portion, 
however, is computed by the expression 

Insert{ }(e) x Insert{ }(&) (16) 
yielding automaton (a): 

<o,o> 



<0.b> 





<0,0> Any <0.0> Any 

a b 
If the centre of the SC portion had been padded 
with O's, the centre would have been 



lnsert{ }(0) x Insert.ro} (&) (17) 

yielding the undesired automaton (b). Both are 
similar except that state Qq is final in the former. 
Taking (a) as the centre, eq. [h] includes (cd,cd); 
hence, eq. ^ excludes it. The compilation of our 



5 Conclusion and Future Work 

This paper showed how to compile the partition 
formalism into N-tape automata. Apart from in- 
creased efficiency and portability of implementa- 
tions, this result also enables us to more easily 
relate this formalism to others in the field, using 
the finite-state calculus to describe the relations 
implemented by the rule compiler. 

A small-scale prototype of the algorithm has 
been implemented in Prolog. The rule compiler 
makes use of a finite-state calculus library which 
allows the user to compile regular expressions into 
automata. The regular expression language in- 
cludes standard operators in addition to the op- 
erators defined here. The system has been tested 
with a number of hypothetical rule sets (to test 
the integrity of the algorithm) and linguistically 
motivated morphological grammars which make 
use of multiple tapes. Compiling realistic descrip- 
tions would need a more efficient implementation 
in a more suitable language such as C/C++. 

Future work includes an extension to simulate 
a restricted form of unification between categories 
associated with rules and morphemes. 
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